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1 Session S6.2: compilers and program analysis: PACT HDL: a C compiler targeting ASICs and |g| 
FPGAs with power and performance optimizations 

Alex Jones, Debabrata Bagchi, Satrajit Pal, Xiaoyong Tang, Alok Choudhary, Prith Banerjee 
October 2002 Proceedings of the 2002 international conference on Compilers, architecture, and 
synthesis for embedded systems 



Full taxi available: | 



Additional Information: full diction , abstract . raftrtncts . Indax ttrmi 



Chip fabrication technology continues to plunge deeper into sub-micron levels requiring hardware 
designers to utilize ever-increasing amounts of logic and shorten design time. Toward that end, high- 
level languages such as C/C++ are becoming popular for hardware description and synthesis in order to 
more quickly leverage complex algorithms. Similarly, as logic density increases due to technology, 
power dissipation becomes a progressively more important metric of hardware design. PACT HDL, a C 
to ... 

Keywords: ASIC, FPGA, FSM, HDL, IP, SoC, VHDL, Verilog, compiler, high-performance, levelization, 
low-power, pipelining, synthesis 



2 Performance counters and state sharing annotations: a unified approach to thread locality 
Boris Weissman 

October 1998 Proceedings of the eighth international conference on Architectural support for 
programming languages and operating systems, Volume 33 , 32 issue n , 5 



Full taxi available: | 



Additional Information: fun citation , abstract . raftrencas . tfltngs. mdax tarmi 



This paper describes a combined approach for improving thread locality that uses the hardware 
performance monitors of modem processors and program-centric code annotations to guide thread 
scheduling on SMPs. The approach relies on a shared state cache model to compute expected thread 
footprints in the cache on-line. The accuracy of the model has been analyzed by simmations involving a 
set of parallel applications. We demonstrate how the cache model can be used to implement several 
practical loca ... 

3 Track 5: supercomputing (part 2): A case for a working-set-based memory hierarchy 
Steve Carr, Soner Onder 

May 2005 Proceedings of the 2nd conference on Computing frontiers 

Full t»xt available: rfS0t pdH163.8O KBt Additional Information: fun citation , abstract , ftffran cai. Index termi 



Modern microprocessor designs continue to obtain impressive performance gains through increasing 
clock rates and advances in the parallelism obtained via micro-architecture design. Unfortunately, 
corresponding improvements in memory design technology have not been realized, resulting in latencies 
of over 100 cycles between processors and main memory. This ever-increasing gap in speed has pushed 
the current memory-hierarchy approach to its limit.Traditional approaches to memory-hierarchy 
manageme ... 



Keywords: cache design, loop tiling 



4 An optimal memory allocation scheme for scratch-pad-based embedded systems 
Oren Avissar, Rajeev Barua, Dave Stewart 

November2002 ACM Transactions on Embedded Computing Systems (TECS), Volume l Issue l 

Full text avaBablt: ■S^ Dd«396.62 KB) Additional Information: fun citation. abitrBti. reference! , tWffl. Mm t»rffll 



This article presents a technique for the efficient compiler management of software-exposed 
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heterogeneous memory. In many lower-end embedded chips, often used in microcontrollers and DSP 
processors, heterogeneous memory units such as scratch-pad SRAM, internal DRAM, external DRAM, 
and ROM are visible directly to the software, without automatic management by a hardware caching 
mechanism. Instead, the memory units are mapped to different portions of the address space. Caches 
are avoided due to the ... 

Keywords: Memory, allocation, embedded, heterogeneous, storage 



Cool-Cache; A compiler-enabled energy efficient data caching framework for embedded/multimedia 
processors 

Osman S. Unsal, Raksit Ashok, Israel Koren, C. Mani Krishna, Csaba Andras Moritz 

August 2003 ACM Transactions on Embedded Computing Systems (TECS), Volume 2 issue 3 

Full lixl avaUebl*: ^m^pctf(834.?7 Kp) Additional Inform silon: fun citation , abstract , rtfaranctt . Indax Iwmt 

The unique characteristics of multimedia/embedded applications dictate media-sensitive architectural 
and compiler approaches to reduce the power consumption of the data cache. Our goal is exploring 
energy savings for embedded/multimedia workloads without sacrificing performance. Here, we present 
two complementary media-sensitive energy-saving techniques that leverage static information. While 
our first technique is applicable to existing architectures, in our second technique we adopt a more 
rad ... 

Keywords: Low-power design, cache partitioning, compiler-architecture interaction, tagless caching 



6 Multimedia and graphics: Cool-cache for hot multimedia 

Osman S. Unsal, Raksit Ashok, Israel Koren, C. Mani Krishna, Csaba Andras Moritz 
December 2001 Proceedings of the 34th annual ACM /IEEE international symposium on 
M icroa rchi tectu re 

Fullltxl svaDoblt: t^n\\ 

H Additional Information; Ml dtallon . ab»trad . r»ftrtnct» . dHnoi 



We claim that the unique characteristics of multimedia applications dictate media-sensitive architectural 
and compiler approaches to reduce the power consumption of the data cache. Our motivation is 
exploring energy savings for real-time multimedia workloads without sacrificing performance. Here, we 
present two complementary media-sensitive energy-saving techniques that leverage static information. 
While our first technique is applicable to existing architectures, in our second technique we adop ... 

The benefits and costs of DyC's run-time optimizations 

Brian Grant, Markus Mock, Matthai Phitipose, Craig Chambers, Susan J. Eggers 

September 2000 Transactions on Programming Languages and Systems (TOPLAS), volume 22 issue 5 

Full taxt avaflabl* : ttnrt pdff 1 .59 MB) Additional Information: fun citation , abstract . ra fartncti . c ffln flf . Indtx t>rmi 



DyC selectively dynamically compiles programs during their execution, utilizing the run-time-computed 
values of variables and data structures to apply optimizations that are based on partial evaluation. The 
dynamic optimizations are preplanned at static compile time in order to reduce their run-time cost; we 
call this staging. DyC's staged optimizations include (1) an advanced binding-time analysis that supports 
polyvariant specialization (enabling both single-way and multi ... 

Keywords: dynamic compilation, specialization 



8 Coupling compiler-enabled and conventional memory accessing for energy efficiency 
Raksit Ashok, Saurabh Chheda, Csaba Andras Moritz 

May 2004 ACM Transactions on Computer Systems (TOCS), Volume 22 Issue 2 

Full lixl avaPablt: tfoftfr pdf(1,4 1 MB) Additional Information: ftill dtatlon . abstract . rtftr»ncM . lnd»x |»rm* 



This article presents Cool-Mem, a family of memory system architectures that integrate conventional 
memory system mechanisms, energy-aware address translation, and compiler-enabled cache 
disambiguation techniques, to reduce energy consumption in general-purpose architectures. The 
solutions provided in this article leverage on interlayer tradeoffs between architecture, compiler, and 
operating system layers. Cool-Mem achieves power reduction by statically matching memory operations 
with energy-eff ... 

Keywords: Energy efficiency, translation buffers, virtually addressed caches 



Compiler transformations for high-performance computing 

David F. Bacon, Susan L. Graham, Oliver J. Sharp 

oecmbensM ACM Computing Surveys (CSUR), Volume 26 Issue 4 



Fullltxl available: i 



Additional Information: full citation , abitracl . rahf n c»|. cfttnc* . ind«x , (tpm . rtvttw 
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In the last three decades a large number of compiler transformations for optimizing programs have 
been implemented. Most optimizations for uniprocessors reduce the number of instructions executed by 
the program using transformations based on the analysis of scalar quantities and data-flow techniques. 
In contrast, optimizations for high-performance superscalar, vector, and parallel processors maximize 
parallelism and memory locality with transformations that rely on tracking the properties o ... 

Keywords: compilation, dependence analysis, locality, multiprocessors, optimization, parallelism, 
superscalar processors, vectorization 



10 CRL: high-performance all-software distributed shared memory 
K. L. Johnson, M. F. Kaashoek, D. A. Wallach 



December 1995 



FuttUxt avaltabl*: 



ACM SIGOPS Operating Systems Review , Proceedings of the fifteenth ACM symposium 
on Operating systems principles, Volume 29 issue 5 

Additional Information: Ml citation , rtftrtncn . clttnai . tndtx t»rm« 



11 Retaraetable estimation scheme for DSP architecture selection 
Naji Ghazal, Richard Newton, Jan Rabaey 

January 2000 Proceedings of the 2000 conference on Asia South Pacific design automation 

Full ttxt evaOabla: rfro^ pdf{$7,80 KB1 Additional Information: M citation , rafarancai . clttnai 



12 Programming data parallel algorithms on distributed memory using Kali^ 
Charles Koelbel, Piyush Mehrotra 

June i99i Proceedings of the 5th international conference on Supercomputing 

Full text available cfi5*| pdr(937.23 KB) Additional Information: fun dtatlon . rtftftncai . cltlnpi. Indax ltrm« 



High performance Fortran language specification 
CORPORATE Rice University 

December 1993 ACM SIGPLAN Fortran Forum, Volume 12 Issue 4 

Full ttict avoDablt: ifipTl > Pd«S.63 MB) Additional Information: full etlallon , abstract , cfflngi . Indtx ttrma 



(PART I)Fortran Forum is reprinting this High Performance Fortran Language Specification over several 
issues. The current issue is devoted to the first four chapters of the HPFF Language Specification. 
Remaining chapters of the HPFF Language Specification, and the HPFF Journal of Development, will be 
printed in installments in filture issues of Fortran Forum. 

14 Comparing data forwarding and prefetching for communication-induced misses in shared-memory 
MPs 

David Koufaty, Josep Torretlas 

juiy 1998 Proceedings of the 12th international conference on Supercomputing 

Full taxi avaBabla: a jg^pdffl.12 MB) Additional Information: ful) citation , rafarancti . citing*, ftdax tarmi 



15 Influence of Memory Hierarchies on Predictability for Time Constrained Embedded Software 
Lars Wehmeyer, Peter Marwedel 

March 2005 Proceedings of the conference on Design, Automation and Test in Europe - Volume 1 



Full taxi available: maljl pctffSU.OB KB) . Additional Information: full citation , abstract 



Safety-critical embedded systems having to meet real-time constraints are expected to be highly 
predictable in order to guarantee at design time that certain timing deadlines will always be met. This 
requirement usually prevents designers from utilizing caches due to their highly dynamic, thus hardly 
predictable behavior. The integration of scratchpad memories represents an alternative approach which 
allows the system to benefit from a performance gain comparable to that of caches while at the sam ... 

An evaluation of staged run-time optimizations in DyC 
Brian Grant, Matthai Philipose, Markus Mock, Craig Chambers, Susan 3. Eggers 
May 1999 ACM SIGPLAN Notices , Proceedings of the ACM SIGPLAN 1999 conference on 
Programming language design and implementation, Volume 34 issue 5 

Additional Information: full citation , abstract , raftrtnett . cltlngi . Indtx Itrmi 



Previous selective dynamic compilation systems have demonstrated that dynamic compilation can 
achieve performance improvements at low cost on small kernels, but they have had difficulty scaling to 
larger programs. To overcome this limitation, we developed DyC, a selective dynamic compilation 
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system that includes more sophisticated and flexible analyses and transformations. DyC is able to 
achieve good performance improvements on programs that are much larger and more complex than the 
kernels. We ... 

Exploiting dual data-memory banks in digital signal processors 

Mazen A. R. Saghir, Paul Chow, Corinna G. Lee 

September 1996 Proceedings of the seventh international conference on Architectural support for 
programming languages and operating systems, Volume 31 , 30 issue 9 , 5 

Full tixl ava&ablt: fla*l pdfd.24 MB) Additional Information: fun gallon , abitrael . rtf>rtrKtj - citi n g*. Indax larmi 

m 

Over the past decade, digital signal processors (DSPs) have emerged as the processors of choice for 
.implementing embedded applications in high-volume consumer products. Through their use of 
specialized hardware features and small chip areas, DSPs provide the high performance necessary for 
embedded applications at the low costs demanded by the high-volume consumer market. One feature 
commonly found in DSPs is the use of dual data-memory banks to double the memory system's 
bandwidth. When coupled ... 

Java annotation-aware iust-in-time (AJIT) complilation system 
Ana Azevedo, Alex Nicolau, Joe Hummel 

June 1999 Proceedings of the ACM 1999 conference on Java Grande 

Fulltoxt avattabla: BBfr^ p?fn,?E MB) AddlllonaMnformatlon: full citation . rtftrtnc* i . cfltng* . Indax ttrmt 



1 Using an architectural knowledge base to generate code for parallel computers 
Anthony E. Terrano, Stanley M. Dunn, Joseph E. Peters 
September 1969 Communications of the ACM, Volume 32 Issue 9 

Full ttxt avallabtt: tfm° )ptlf(1.01 MB) Additional Information: fug citation , abstract . r«f»ranc»i . citing* . Indax Urrni . rtvltw 



The authors present a reconfigurable compiler for distributed memory parallel computers that performs 
automatic program partitioning, mapping, and communication code generation under the guidance of 
directives supplied by the programmer. 

Bidwidth analysis with application to silicon compilation 
Mark Stephenson, Jonathan Babb, Saman Amarasinghe 

May 2000 ACM SIGPLAN Notices , Proceedings of the ACM SIGPLAN 2000 conference on 
Programming language design and implementation, Volume 35 issue 5 

Full taxt available: nBp N cdff930.97 KB! Additional Information: full citation , abitract . rafirtncn . cfllngi. (ndfu . tatmi 



This paper introduces Bitwise, a compiler that minimizes the bitwidth the number of bits used to 
represent each operand for both integers and pointers in a program. By propagating 70 static 
information both forward and backward in the program dataflow graph, Bitwise frees the programmer 
from declaring bitwidth invariants in cases where the compiler can determine bitwidths automatically. 
Because loop instructions comprise the bulk of dynamically executed instructions, Bitw ... 
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